188 research outputs found

    Analyse morphologique non supervisée en domaine biomédical. Application à la recherche d'information

    Get PDF
    International audienceDans le domaine biomĂ©dical, utiliser des termes spĂ©cialisĂ©s est essentiel pour accĂ©der Ă  l'information. Cependant, dans beaucoup de langues, ces termes sont des constructions morphologiques complexes qui compliquent cet accĂšs Ă  l'information. Dans cet article, nous nous intĂ©ressons Ă  l'identiïŹcation des composants morphologiques de ces termes et Ă  leur utilisation pour une tĂąche de recherche d'information (RI). Nous proposons diffĂ©rentes approches reposant sur un alignement automatique avec une langue pivot particuliĂšre, le japonais, et sur un apprentissage par analogie permettant de produire des analyses morphologiques ïŹnes des termes d'une langue donnĂ©e. Ces analyses morphologiques sont ensuite utilisĂ©es pour amĂ©liorer l'indexation de documents biomĂ©dicaux. Les expĂ©riences rapportĂ©es montrent la validitĂ© de cette approche avec des gains en MAP de plus de 10 % par rapport Ă  un systĂšme de RI standard

    Topic segmentation of TV-streams by watershed transform and vectorization

    Get PDF
    International audienceA fine-grained segmentation of Radio or TV broadcasts is an essential step for most multimedia processing tasks. Applying segmentation algorithms to the speech transcripts seems straightforward. Yet, most of these algorithms are not suited when dealing with short segments or noisy data. In this paper, we present a new segmentation technique inspired from the image analysis field and relying on a new way to compute similarities between candidate segments called Vectorization. Vectorization makes it possible to match text segments that do not share common words; this property is shown to be particularly useful when dealing with transcripts in which transcription errors and short segments makes the segmentation difficult. This new topic segmen-tation technique is evaluated on two corpora of transcripts from French TV broadcasts on which it largely outperforms other existing approaches from the state-of-the-art

    Inferring syntactic rules for word alignment through Inductive Logic Programming

    Get PDF
    International audienceThis paper presents and evaluates an original approach to automatically align bitexts at the word level. It relies on a syntactic dependency analysis of the source and target texts and is based on a machine-learning technique, namely inductive logic programming (ILP). We show that ILP is particularly well suited for this task in which the data can only be expressed by (translational and syntactic) relations. It allows us to infer easily rules called syntactic alignment rules. These rules make the most of the syntactic information to align words. A simple bootstrapping technique provides the examples needed by ILP, making this machine learning approach entirely automatic. Moreover, through different experiments, we show that this approach requires a very small amount of training data, and its performance rivals some of the best existing alignment systems. Furthermore, cases of syntactic isomorphisms or non-isomorphisms between the source language and the target language are easily identified through the inferred rules

    La phonétisation comme un problÚme de translittération

    Get PDF
    International audiencePhonetizing is a crucial step to process oral documents. In this paper, a new word-based phonetization approach is proposed ; it is automatic, simple, portable and efficient. It relies on machine learning ; thus, the system is built from examples of words with their pho- netic representations. More precisely, it makes the most of a technique inferring rewriting rules initially developed for transliteration and translation. In order to evaluate the performances of this approach, we used several datasets from the Pronalsyl Pascal challenge, including different languages. The obtained results equal or outperform those of the best known systems.La phonétisation est une étape essentielle pour le traitement de l'oral. Dans cet article, nous décrivons un systÚme automatique de phonétisation de mots isolés qui est simple, portable et performant. Il repose sur une approche par apprentissage ; le systÚme est donc construit à partir d'exemples de mots et de leur représentation phonétique. Nous utili- sons pour cela une technique d'inférence de rÚgles de réécriture initialement développée pour la translittération et la traduction. Pour évaluer les performances de notre approche, nous avons utilisé plusieurs jeux de données couvrant différentes langues et divers alphabets phonétiques, tirés du challenge Pascal Pronalsyl. Les trÚs bons résultats obtenus égalent ou dépassent ceux des meilleurs systÚmes de l'état de l'art

    Dimensionnalité intrinsÚque dans les espaces de représentation des termes et des documents

    Get PDF
    National audienceExamining the properties of representation spaces for documents or words in IR (typically R n with n large) brings precious insights to help the retrieval process. Recently, several authors have studied the real dimensionality of the datasets, called intrinsic dimensionality, in specific parts of these spaces (Houle et al., 2012a). In this paper, we propose to revisit this notion through a coefficient called α in the specific case of IR and to study its use in IR tasks. More precisely, we show how to estimate α from IR similarities and to use it in representation spaces used for documents and words (Mikolov et al., 2013 ; Claveau et al., 2014). Indeed, we prove that α may be used to characterize difficult queries; moreover we show that this intrinsic dimensionality notion, applied to words, can help to choose terms to use for query expansion.L'examen des propriĂ©tĂ©s des espaces de reprĂ©sentation des documents ou des mots en RI (typiquement, R n avec n trĂšs grand) fournit de prĂ©cieuses indications pour aider la re-cherche. RĂ©cemment, plusieurs travaux ont montrĂ© qu'il Ă©tait possible d'Ă©tudier la dimensionnalitĂ© rĂ©elle des donnĂ©es, appelĂ©e dimensionnalitĂ© intrinsĂšque, en certains points de ces espaces (Houle et al., 2012a). Dans cet article, nous proposons de revisiter cette notion de dimension intrinsĂšque sous la forme d'un indice notĂ© α dans le cas particulier de la RI et d'Ă©tudier son utilisation pratique en RI. Plus prĂ©cisĂ©ment, nous montrons comment son estimation Ă  partir de similaritĂ©s de type RI, peut ĂȘtre utilisĂ©e dans les espaces de reprĂ©sentations des documents et les espaces de reprĂ©sentations de mots (Mikolov et al., 2013 ; Claveau et al., 2014). Ainsi, nous montrons d'une part que l'indice α aide Ă  caractĂ©riser les requĂȘtes difficiles ; d'autre part, dans une tĂąche d'extension de requĂȘte, nous montrons comment cette notion de dimensionnalitĂ© intrinsĂšque appliquĂ©e Ă  des mots permet de choisir au mieux les termes Ă  Ă©tendre et leurs extensions

    PPL-MCTS: Constrained Textual Generation Through Discriminator-Guided MCTS Decoding

    Full text link
    Large language models (LM) based on Transformers allow to generate plausible long texts. In this paper, we explore how this generation can be further controlled at decoding time to satisfy certain constraints (e.g. being non-toxic, conveying certain emotions, using a specific writing style, etc.) without fine-tuning the LM. Precisely, we formalize constrained generation as a tree exploration process guided by a discriminator that indicates how well the associated sequence respects the constraint. This approach, in addition to being easier and cheaper to train than fine-tuning the LM, allows to apply the constraint more finely and dynamically. We propose several original methods to search this generation tree, notably the Monte Carlo Tree Search (MCTS) which provides theoretical guarantees on the search efficiency, but also simpler methods based on re-ranking a pool of diverse sequences using the discriminator scores. These methods are evaluated, with automatic and human-based metrics, on two types of constraints and languages: review polarity and emotion control in French and English. We show that discriminator-guided MCTS decoding achieves state-of-the-art results without having to tune the language model, in both tasks and languages. We also demonstrate that other proposed decoding methods based on re-ranking can be really effective when diversity among the generated propositions is encouraged.Comment: 15 pages, 5 tables, 7 figures, accepted to NAACL 202

    Detecting fake news in tweets from text and propagation graph: IRISA's participation to the FakeNews task at MediaEval 2020

    Get PDF
    International audienceThis paper presents the participation of IRISA to the task of fake news detection from tweets, relying either on the text or on propagation information. For the text based detection, variants of BERT-based classification are proposed. In order to improve this standard approach, we investigate the interest of augmenting the dataset by creating tweets with fine-tuned generative models. For the graph based detection, we have proposed models characterizing the propagation of the news or the users' reputation

    Speculation and negation detection in french biomedical corpora

    Get PDF
    International audienceIn this work, we propose to address the detection of negation and speculation, and of their scope, in French biomedical documents. It has been indeed observed that they play an important role and provide crucial clues for other NLP applications. Our methods are based on CRFs and BiLSTM. We reach up to 97.21 % and 91.30 % F-measure for the detection of negation and speculation cues, respectively , using CRFs. For the computing of scope, we reach up to 90.81 % and 86.73 % F-measure on negation and speculation , respectively, using BiLSTM-CRF fed with word embeddings

    Measuring vagueness and subjectivity in texts: from symbolic to neural VAGO

    Full text link
    We present a hybrid approach to the automated measurement of vagueness and subjectivity in texts. We first introduce the expert system VAGO, we illustrate it on a small benchmark of fact vs. opinion sentences, and then test it on the larger French press corpus FreSaDa to confirm the higher prevalence of subjective markers in satirical vs. regular texts. We then build a neural clone of VAGO, based on a BERT-like architecture, trained on the symbolic VAGO scores obtained on FreSaDa. Using explainability tools (LIME), we show the interest of this neural version for the enrichment of the lexicons of the symbolic version, and for the production of versions in other languages.Comment: Paper to appear in the Proceedings of the 2023 IEEE International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT

    Supervised Machine Learning Techniques to Detect TimeML Events in French and English

    Get PDF
    International audienceIdentifying events from texts is an information extraction task necessary for many NLP applications. Through the TimeML specifications and TempEval challenges, it has received some attention in the last years; yet, no reference result is available for French. In this paper, we try to fill this gap by proposing several event extraction systems, combining for instance Conditional Random Fields, language modeling and k-nearest-neighbors. These systems are evaluated on French corpora and compared with state-of-the-art methods on English. The very good results obtained on both languages validate our whole approach
    • 

    corecore